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Abstract 

Assume we have two bijective functions U{x) and M{x) with 
M{x) / U(x) for all x and M,N : N —>■ N . Every day and in 
different locations, we see the different results of U and M without 
seeing x. We are not assured about the time stamp nor the order 
within the day but at least the location is fully defined. We want to 
find the matching between U(x) and M [x) (i.e., we will not know 
x). We formulate this problem as an adaptive graph mining: we 
develop the theory, the solution, and the implementation. This work 
stems from a practical problem thus our definitions. The solution is 
simple, clear, and the implementation parallel and efficient. In our 
experience, the problem and the solution are novel and we want to 
share our finding. 

1 Introduction 

Let start by introducing the problem by its practical case. 
We are traveling with our smart phone. We take a taxi and 
go to the airport. We surf using our data plan. We arrive 
at the airport and we connected to the local WiFi, we surf. 
Before boarding we turn off our phone. We land and the 
previous process restarts. During our surfing, our phone will 
be identified by a unique number as a function of the device 
and application (i.e., UUID). While we are using the WiFi, 
our device will have also a MAC address and IP. If we have 
the distinct set of MACs and UUIDs, can we find the match: 
what UUID is associated with the MAC? 

If we identify our phone as x, we have two deterministic 
functions; function U{x, t, £) with location f and time t that 
identifies our unique device, and function M{x,t,t} with 
location and time that identifies the MAC. We have only a 
sample in time of U(x, t, £) and a sample by location of M. 
In practice. We may not gain U{x) and M{x) at the same 
time but in a reasonable interval of time, say one day: for 
example, at a specific airport and date (day) we may have 
either one but not both with no specific time information 
beside the day. Also, given x we may have S = U{x, 
that is U(x) is not unique and it may the composition of a set 
of exclusive functions Ui (x) but when possible we enforce a 
deterministic and unique result. 

The problem boils down as to answer the following 

* pdalberto@ ninthdecimal.com 

tvmilenkiy@ninthdecimal.com 


question; If we are observing the output of [/ and M, can 
we guess x, which is associated to U{x) and M{x)l 

We define an airport as Li with i € [0, iV — 1]. There 
are N airports and we enumerate them. We describe the first 
day we observe events simply as fg- Thus U is the second 
day: this will imply that day U precedes fi+i. 

Let us start considering the first day fg- For every Li 
there is a set of associated MAC address, we identify this set 
as Also, we determine the users in one mile radius 

from Li- We identify this set as 

(1.1) = {u : dist{u, Li) < 1 at time to} 

The user set is not complete because we have only 
a sample of the available impressions: we sample in time 
the values of U{x,t,£), we cannot keep an ordered time 
sequence beside a day granularity, and by construction we 
may cover only a small area of the airport Li. 

In practice, we associate the departing 

addresses to the departing users. This is the mapping we 
would like to refine as much as possible, until we can have a 
one-to-one matching. That is, we can infer the hidden x that 
determines the unique mapping between U{x) and M{x). 
These same users are landing to different Lj with i ^ j and 
thus different addresses and may be given. 

Every mapping describes a graph, a fully 

connected bipartite graph. We need to combine all mappings 
as above in order to achieve our goal. This is an adaptive 
graph algorithm: we build a graph step by step, day by 
day. We check whether mappings in different graphs have 
intersection and we can split the graph by cutting edges. 

The final goal is to grind these mappings into matches 
where one user is associate to one MAC or at least to the 
finest refinement possible. In the following, we formulate 
our solution using the same notations, we present our algo¬ 
rithm, a few simplifications, and our results. 

2 The algorithm 

Consider the mappings for the first day: 

These can be considered as the departing addresses for the 
departing users. The users departing from Li can be the user 
landing at Lj with j i. If there is intersection between 
addresses we could refine the mapping: 
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Here the + operation is the disjoint concatenation of map¬ 
pings binding fewer elements and ref ining them towards 
matches. Our interpretation of Equation |2.2[ follows: if there 
is any landing information the mapping between departing 
users and departing address, then we can refined the map¬ 
ping into two major components. 
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Equation 2.3 and 2.2 (part one) refer to the departing users 
without landing information. 
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If there is an intersection between departing and landing 
users, we refine the mapping with the intersection of the 
departing and landing addresses. The locations are disjoint, 
very likely a user will be only at two locations in two days 
and thus n ( U„=o.i will be true only for one j, 
then the mappings are disjoint. Because we are considering 
two consecutive days we may have to combine two or more 
mappings one step further. Let us introduce the product of 
mappings. 

By definition, if we have a {A : M) any user in A can be 
mapped to any address in M. As a graph, this represents a 
bipartite fully-connected graph. If we have another mapping 
{B : N) and there is an intersection between addresses, then 
we know that the same users should be in both A and B. 
After all the address is unique to the device. It makes sense 
to take the intersection of users as well. In practice, we 
assume users will have likely or consistently the same user 
identification number. This will refine the mapping reducing 
the size of the three resulting mappings; 

{A: M) *{B N) ={A \ B : M \ N) + 

(2.5) {An B : M r\N) + 

{B\A-.N\M) 


We use the * operator to represent this operation. In 
combination with + operator, our algorithm will be based an 
algebra. If there is no intersection, there is no refinement and 
{A: M) * {B : N) = {A: M) + {B : N). 

If there are only two mappings the product is intuitive 
and the final result is a disjoint mapping. Let us consider 


'This definition does not fully represent reality. For example, a 
Af(a:o) = mo is unique and U{xo) is not unique say uq in {A : M) and 
ui in (B : N), then in Equation |2.5| we have {uq : 0} -|- (0 : mo) -I- («i : 
0) = 0 instead of (ttoi ■ mo). The definition of * operation is seeking 
for a deterministic and unique in time match. 


two mappings composed by disjoint simpler mappings and 
their products 
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where wj = {Aj : Mj) and Vi = {Bi : Ni). We will abuse 
the set notation a little here: 
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We notice that the sum LfLo \ ^ : 

Mj \ Ni) is not disjoint because every term has in common 
the mapping; 
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also Wj \ vq and Wj \ vi have in common the one above and 
'j 2 f =2 WjHvi, which are already included in the second term 
in Equation |2.7| Thus the first term in Equation |2.7| becomes 
basically Wj \ {J^fLo ^i)- 
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Equation |2.8| represents a disjoint mapping. 

Let us return to Equation |2.2| (part two) and 2.4 espe¬ 
cially how to combine the terms that have intersection: we 
can imagine that the index j infers an order for the compo¬ 
nents: the destination Lq, Li, ..., and Ljv-i- At the end of 
the first day to we can summarize our knowledge as 


(2.9) 
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For the refinement in Equation |2.9[ we have a disjoint set 
of mappings. Now we must combine them. The first term in 
Equation |2.2| (part one) presents disjoint mappings and thus 
can be just added in Equation |2.9| The second term is a little 
trickier. We see it as the intersection of mappings that have 
common users and thus narrowing the mapping size. The 
second should reduce to a perfect matching and when it does 
we can remove the users and put them aside. 

Now let us consider the second day ti. Let us compute 
Dt^ independently from the previous step. Then we join 
the two steps by checking users intersections and refining 
the mappings: for each mapping in Dt^ we can make a 
product/intersection of each mapping in Dtg and thus: 

Dti — Dt^ * Dtg 


See the symmetric property of the product. Before any 
product or update, the terms are a list of disjoint mappings. 
The product is meant to combine mappings that have com¬ 
mon addresses so that to refine the mappings into matches. 

We should keep an order during the concatenation, for 
example: 
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3 A Study in Parallelism 

The daily mappings {S^' : requires the data from two 

consecutive days: ti and f^+i. The first parallel computation 
is based on the split of the interval of time into smaller 
and consecutive intervals: two week interval each, say. We 
compute each two-week interval in parallel. This is an 
embarrassing parallelism. 

The total interval of time is composed of six months 
of data, we actually split the computation into up to 15 
independent computations. Each Dt is composed by a set of 
matches and mappings. We take the list of Dt^ and compute 
consecutive-pair products as a binary tree. 

Obviously, the last computation in the binary tree is a 
single product and it seems that there is no parallelism to 
exploit. Take the example in Eigure The final product 
^56 = ^28 * ^56 will require at least as much as the sum of 
the previous computations: 0 {Di 4 * D 2 s) + 0 ( 1)42 * D^q), 
which does not seem parallel friendly. 

In practice, as we go up in the tree, we loose explicit par¬ 
allelism but we can exploit the same amount of parallelism 
in the product. Thus, we can keep the same level of paral¬ 
lelism throughout the computation and thus efficient use of 
any architecture. 

The product becomes more complex as we go up. In 
fact, the product has to explore a Cartesian product of the 



D 




\ 


\ 


\ 



\ 

\ 

\ 

\ 

\ 

- ► I A I 


Eigure 1: Decomposition of the computation 












operand mappings (graphs). We explore if there are edges 
across the operands and this is why we use the term adaptive 
for the graph we explore and build. 


4 The Implementation 


# R implementation 

product <- function(DO,Dl,P=2) { 

if (length (DO) ==0 && length (Dl) ==0) { R = listO } 

else if ((is.null (DO) || length(DO) ==0) && length(Dl)>0) { 
else if (length(DO) >0 (is.null(Dl) || length(Dl)==0)) { 
else { 

L = group2(1:length(DO),length(DO)/P) 
ii <- function(K) { 

i=0; R = listO; D = list ('S'=c () , ' M'=c () ) 
for (k in K) { 

1 = i 

Q = list ('S'=c() , 'M'=c() ) 
for (j in 1:length(Dl)) { 

S = intersect(D0[[k]]SS,D1[[j]]SS) 

M = intersect(DO[[k]]SM,Dl[[j]]SM) 
if (length(M)>0 && length(S)>0) ( 

i = i +1; R[[i]] =list('S'=S,'M'=M) 

Q$M = union(Q$M,M) 

Q$S = union(Q$S,S) 

) 

} 

if (i>l) { 

S = setdiff(D0[[k]]SS,Q$S) 

M = setdiff(DO[[k]]SM,Q$M) 
if (length(M)>0 && length(S)>0) ( 

i = i +1; R[[i]] =list('S'=S,'M'=M) 

) 

DSM = union(D$M,Q$M) 

DSS = union(D$S,Q$S) 

} else { 

1 = i +1; R([i]] = D0[[k]] 

} 


} 

list("R"=R, "D" = D, "disjoint"=(i==0)) 


R = Dl 
R = DO 


} 

} 


RT = mclapply(L,i 


,me.preschedule=TRUE,me.cores=P) 
list('S'=c{),'M'=c()); disjoint 


R = list 0; D 
for (rt in RT) { 

if (length(rt)>0) ( 
for (r in rtSR) { 

1 = i +1; R[[i]] = r 

} 

disjoint = disjoint && rtSdisjoint 
DSM = union(DSM,rt$D$M) 

DSS = union(DSS,rt$D$S) 

} 

} 

if (disjoint) { 

for (k in l:length(Dl)) { 

i = i +1; R[[i]] = Dl[[k]] 

} 

} else ( 

for (k in l:length(Dl)) { 

S = setdiff(Dl[[k]]SS,DSS) 

M = setdiff(Dl([k]]SM,DSM) 
if (length(M)>0 length (S)>0) { 

i = i +1; R[[i]) = list('S'=S, ' M'=M) 

} 


} 


} 


The product is the core of the whole computation. The 
software came after the formal solution was f ound . The first 
implementation was verbatim from Equation 2.81 This was 


a good starting point. There are actually a few drawbacks: 
First, the intersections are sparse and not balanced; that is, 
there may be intersection between Si but not in between Mi 
or viceversa. This means that the computation WjVi 

will spend quite some work finding empty intersections and 
this information is not used for the other terms. Abusing a 
little the notation, we can rewrite the computation in such 
a way that we avoid the graphs union computations by 


computing the intersection first and reuse it: 

(4.11) 
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Above, we present the implementation in R of the product. 
The operand DO is the mappings at time ti (to) and the 
operand Dl is the mappings at time ti+i (ti). 

The implementation has an important difference from 
Equation |2.8| and the intent in Equation |4.11| the intersec¬ 
tion has to be non empty for both S and M to be recorded 
and used (in the following set difference computation). If we 
could apply equation |2.8| the final product will be the com¬ 
position of disjoint terms. The implementation of Equation 
|4.1 l| does not assure that the final product has disjoint terms 
and actually it may allow identical terms to appear, thanks 
to the symmetric nature or the graph. Our implementation 
allows a minimum and consistent computation: equal terms 
are removed and no-disjoint terms involving perfect matches 
are simplified. 

As we can see, the product explores all pairs to find in¬ 
tersections, a square effect on the computational complexity. 
Our implementation choice is based on reducing the com¬ 
plexity even though by a constant. 

5 The Case Study 

Due to the proprietary nature of the data, we cannot share the 
set itself and a few of its details. However, we share the code 
verbatim because of its simplicity (and will share the code 
upon request). 

We observed about six airports for about six months. 
We observed about five hundred thousand unique MACs that 
appear more than once (if there is only one appearance, there 
is very little signal and a matching will be possible only if we 
match all other MACs, which is unlikely). 

We observed nine million unique users collected in a 
radius of one mile from the airports requested center of 
interest and they appeared more than two times during the 
entire period. On average, we have 2.5 million unique users 
and ten thousand unique MACs per day. 

We build the mappings using two different granularities. 
See Figure We use a granularity of two and four weeks to 
start the computation. This is to cope with the randomness 
of the user observation: We can only obtain a sample of 
the users UUID and their appearance or their lack affect 
the matches and their products. Also the asymmetric nature 
of the product implementation exemplified of Equation |4TT] 
will make the resulting graphs different. Otherwise, the 
graphs should be completely deterministic, consistently built 









Table 1: Two/four-week mapping graph (above) and user/mac distribution (below) 


weeks 

matches 

mappings 

users covered 

macs coverage 

2 

30667 

130304 

23448155 

689828 

4 

33912 

126650 

24153536 

686038 


Table 2: Ratio user/mac distribution 


weeks 

Min. 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max. 

2 

0.007 

3.000 

8.000 

80.990 

30.000 

28690.000 

4 

0.012 

3.000 

8.000 

87.940 

32.000 

27440.000 


and eventually identical. 

The computation time also may differ because of the 
different sparsity of the mappings and their combinations. 
We present the results separately and we conclude this 
section with a few considerations. 

5.1 The two and four week graphs The process follows 
the one presented in Figure we start building day-by-day 
graph up to two/four weeks. Then, we build the full graph. 

Using the same number of resources, 16 cores for each 
computation, the four week graph is a little faster (i.e., 2 hrs 
faster for a 4 days computation from end-to-end), provides 
more matches, fewer mappings but more redundancy. Notice 
that the two-week graph exploits more parallelism and it will 
be faster if more resource could be used at the beginning of 
the process. 

If we take a graph and we compute the ratio of the 
users number over the MACs number in each mapping, we 
can summarize the graphs using their distributions. We 
summarize their distribution in Table In practice, the 
two-week graph has fewer matches but the mappings tend to 
be more refined than the mappings in the four-week graph. 
Table [U 

We have not tried to combine the results (four and two 
weeks); it is possible to take the graphs combine the matches 
and then compute the product of the mappings. This is left 
as future investigation. 

5.2 Considerations The problem formulation and its no¬ 
tations were used to write a first implementation: the first 
prototype was applied to a small graph after a few weeks. 
The choice to write the solution in R was for the ease in con¬ 
necting to different databases where the data were available. 
The simple semantic of the language fit the original formu¬ 
lation well. 

As we increased the size of the graph, we decided to 
keep the original solution, exploit the R parallelism, and to 
beef up the hardware (from 8-cores 32 GB machine to 32- 


cores 128GB). However, the square complexity (i.e., 0(A^^) 
nodes of the graph) forced us to tune the code and to relax 
the computation. We had to exploit parallelism in a way that 
it is not natural to R. 

We would suggest to chose a different environment or 
language to exploit parallelism at loop level. 

6 Conclusion 

To the best of our knowledge, the problem is novel because 
the refinement of the mappings requires the intersection of 
two different sets. There is no truth given a priori and thus 
there is no learning. This is an example of graph mining. To 
the best of our knowledge, our solution is novel as well. 

We provide a formal definition of the problem and its 
solution in order to start a conversation. The formal state¬ 
ment actually have been driving our problem presentation 
and solution. The desire of a well defined formalism helped 
us freeing ideas by means of no ambiguity and dangerous 
and misplaced intuitions. 

As result, we have a solution that balances parallelism 
in an elegant fashion as it unfolds during the computation. 
This parallelism is not common and we wanted to share its 
application. This, in itself, could be attractive to others. 


















